python爬虫之requests的使用 |
您所在的位置:网站首页 › session url › python爬虫之requests的使用 |
一、爬虫基本知识
1 爬虫原理:
2 什么是爬虫?
3 爬虫指的是爬取数据。
4
5 什么是互联网?
6 由一堆网络设备把一台一台的计算机互联到一起。
7
8 互联网建立的目的?
9 数据的传递与数据的共享。
10
11 上网的全过程:
12 - 普通用户
13 打开浏览器 --> 往目标站点发送请求 --> 接收响应数据 --> 渲染到页面上。
14
15 - 爬虫程序
16 模拟浏览器 --> 往目标站点发送请求 --> 接收响应数据 --> 提取有用的数据 --> 保存到本地/数据库。
17
18 浏览器发送的是什么请求?
19 http协议的请求:
20 - 请求url
21 - 请求方式:
22 GET、POST
23
24 - 请求头:
25 cookies
26 user-agent
27 host
28
29 爬虫的全过程:
30 1、发送请求 (请求库)
31 - requests模块
32 - selenium模块
33
34 2、获取响应数据(服务器返回)
35
36 3、解析并提取数据(解析库)
37 - re正则
38 - bs4(BeautifulSoup4)
39 - Xpath
40
41 4、保存数据(存储库)
42 - MongoDB
43
44 1、3、4需要手动写。
45
46 - 爬虫框架
47 Scrapy(基于面向对象)
48 53
54 使用Chrome浏览器工具
55 打开开发者模式 ----> network ---> preserve log、disable cache
二、requests库的安装
1、在DOS中输入“pip3 install requests”进行安装 2、在pycharm中进行安装 1、http协议:(以请求百度为例) (1)请求url: https://www.baidu.com/ (2)请求方式: GET (3)请求头: Cookie: 可能需要关注。 User-Agent: 用来证明你是浏览器 注意: 去浏览器的request headers中查找 Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36 Host: www.baidu.com 2、浏览器的使用
3、requests几种使用方式 1 >>> import requests 2 >>> r = requests.get('https://api.github.com/events') 3 >>> r = requests.post('http://httpbin.org/post', data = {'key':'value'}) 4 >>> r = requests.put('http://httpbin.org/put', data = {'key':'value'}) 5 >>> r = requests.delete('http://httpbin.org/delete') 6 >>> r = requests.head('http://httpbin.org/get') 7 >>> r = requests.options('http://httpbin.org/get')4、爬取百度主页 1 import requests 2 3 response = requests.get(url='https://www.baidu.com/') 4 response.encoding = 'utf-8' 5 print(response) # 6 # 返回响应状态码 7 print(response.status_code) # 200 8 # 返回响应文本 9 # print(response.text) 10 print(type(response.text)) # 11 #将爬取的内容写入xxx.html文件 12 with open('baidu.html', 'w', encoding='utf-8') as f: 13 f.write(response.text) 四、GET请求讲解1、请求头headers使用(以访问“知乎发现”为例) (1)、直接爬取,则会出错: 1 访问”知乎发现“ 2 import requests 3 response = requests.get(url='https://www.zhihu.com/explore') 4 print(response.status_code) # 400 5 print(response.text) # 返回错误页面(2)添加请求头之后即可正常爬取 1 # 携带请求头参数访问知乎: 2 import requests 3 4 #请求头字典 5 headers = { 6 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36' 7 } 8 #在get请求内,添加user-agent 9 response = requests.get(url='https://www.zhihu.com/explore', headers=headers) 10 print(response.status_code) # 200 11 # print(response.text) 12 with open('zhihu.html', 'w', encoding='utf-8') as f: 13 f.write(response.text)2、params请求参数 (1)在访问某些网站时,url会特别长,而且有一长串看不懂的字符串,这时可以用params进行参数替换 1 import requests 2 from urllib.parse import urlencode 3 #以百度搜索“蔡徐坤”为例 4 # url = 'https://www.baidu.com/s?wd=%E8%94%A1%E5%BE%90%E5%9D%A4' 5 ''' 6 方法1: 7 url = 'https://www.baidu.com/s?' + urlencode({"wd": "蔡徐坤"}) 8 headers = { 9 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36' 10 } 11 response = requests.get(url,headers) 12 ''' 13 #方法2: 14 url = 'https://www.baidu.com/s?' 15 headers = { 16 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36' 17 } 18 # 在get方法中添加params参数 19 response = requests.get(url, headers=headers, params={"wd": "蔡徐坤"}) 20 print(url) # https://www.baidu.com/s?wd=%E8%94%A1%E5%BE%90%E5%9D%A4 21 # print(response.text) 22 with open('xukun.html', 'w', encoding='utf-8') as f: 23 f.write(response.text)3、cookies参数使用 (1)携带登录cookies破解github登录验证 1 携带cookies 2 携带登录cookies破解github登录验证 3 4 请求url: 5 https://github.com/settings/emails 6 7 请求方式: 8 GET 9 10 请求头: 11 User-Agen 12 13 Cookie: has_recent_activity=1; _ga=GA1.2.1416117396.1560496852; _gat=1; tz=Asia%2FShanghai; _octo=GH1.1.1728573677.1560496856; _device_id=1cb66c9a9599576a3b46df2455810999; user_session=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; __Host-user_session_same_site=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; logged_in=yes; dotcom_user=TankJam; _gh_sess=ZS83eUYyVkpCWUZab21lN29aRHJTUzgvWjRjc2NCL1ZaMHRsdGdJeVFQM20zRDdPblJ1cnZPRFJjclZKNkcrNXVKbTRmZ3pzZzRxRFExcUozQWV4ZG9kOUQzZzMwMzA2RGx5V2dSaTMwaEZ2ZDlHQ0NzTTBtdGtlT2tVajg0c0hYRk5IOU5FelYxanY4T1UvVS9uV0YzWmF0a083MVVYVGlOSy9Edkt0aXhQTmpYRnVqdFAwSFZHVHZQL0ZyQyt0ZjROajZBclY4WmlGQnNBNTJpeEttb3RjVG1mM0JESFhJRXF5M2IwSlpHb1Mzekc5M0d3OFVIdGpJaHg3azk2aStEcUhPaGpEd2RyMDN3K2pETmZQQ1FtNGNzYnVNckR4aWtibkxBRC8vaGM9LS1zTXlDSmFnQkFkWjFjanJxNlhCdnRRPT0%3D--04f6f3172b5d01244670fc8980c2591d83864f60 14方法一:在请求头中拼接cookies 1 import requests 2 3 # 请求url 4 url = 'https://github.com/settings/emails' 5 6 # 请求头 7 headers = { 8 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36', 9 # 在请求头中拼接cookies 10 # 'Cookie': 'has_recent_activity=1; _ga=GA1.2.1416117396.1560496852; _gat=1; tz=Asia%2FShanghai; _octo=GH1.1.1728573677.1560496856; _device_id=1cb66c9a9599576a3b46df2455810999; user_session=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; __Host-user_session_same_site=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; logged_in=yes; dotcom_user=TankJam; _gh_sess=ZS83eUYyVkpCWUZab21lN29aRHJTUzgvWjRjc2NCL1ZaMHRsdGdJeVFQM20zRDdPblJ1cnZPRFJjclZKNkcrNXVKbTRmZ3pzZzRxRFExcUozQWV4ZG9kOUQzZzMwMzA2RGx5V2dSaTMwaEZ2ZDlHQ0NzTTBtdGtlT2tVajg0c0hYRk5IOU5FelYxanY4T1UvVS9uV0YzWmF0a083MVVYVGlOSy9Edkt0aXhQTmpYRnVqdFAwSFZHVHZQL0ZyQyt0ZjROajZBclY4WmlGQnNBNTJpeEttb3RjVG1mM0JESFhJRXF5M2IwSlpHb1Mzekc5M0d3OFVIdGpJaHg3azk2aStEcUhPaGpEd2RyMDN3K2pETmZQQ1FtNGNzYnVNckR4aWtibkxBRC8vaGM9LS1zTXlDSmFnQkFkWjFjanJxNlhCdnRRPT0%3D--04f6f3172b5d01244670fc8980c2591d83864f60' 11 } 12 github_res = requests.get(url, headers=headers)方法二:将cookies做为get的一个参数 1 import requests 2 headers = { 3 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'} 4 cookies = { 5 'Cookie': 'has_recent_activity=1; _ga=GA1.2.1416117396.1560496852; _gat=1; tz=Asia%2FShanghai; _octo=GH1.1.1728573677.1560496856; _device_id=1cb66c9a9599576a3b46df2455810999; user_session=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; __Host-user_session_same_site=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; logged_in=yes; dotcom_user=TankJam; _gh_sess=ZS83eUYyVkpCWUZab21lN29aRHJTUzgvWjRjc2NCL1ZaMHRsdGdJeVFQM20zRDdPblJ1cnZPRFJjclZKNkcrNXVKbTRmZ3pzZzRxRFExcUozQWV4ZG9kOUQzZzMwMzA2RGx5V2dSaTMwaEZ2ZDlHQ0NzTTBtdGtlT2tVajg0c0hYRk5IOU5FelYxanY4T1UvVS9uV0YzWmF0a083MVVYVGlOSy9Edkt0aXhQTmpYRnVqdFAwSFZHVHZQL0ZyQyt0ZjROajZBclY4WmlGQnNBNTJpeEttb3RjVG1mM0JESFhJRXF5M2IwSlpHb1Mzekc5M0d3OFVIdGpJaHg3azk2aStEcUhPaGpEd2RyMDN3K2pETmZQQ1FtNGNzYnVNckR4aWtibkxBRC8vaGM9LS1zTXlDSmFnQkFkWjFjanJxNlhCdnRRPT0%3D--04f6f3172b5d01244670fc8980c2591d83864f60' 6 } 7 8 github_res = requests.get(url, headers=headers, cookies=cookies) 9 10 print('15622792660' in github_res.text) 五、POST请求讲解1、GET和POST介绍 (1)GET请求: (HTTP默认的请求方法就是GET) * 没有请求体 * 数据必须在1K之内! * GET请求数据会暴露在浏览器的地址栏中 (2)GET请求常用的操作: 1. 在浏览器的地址栏中直接给出URL,那么就一定是GET请求 2. 点击页面上的超链接也一定是GET请求 3. 提交表单时,表单默认使用GET请求,但可以设置为POST (3)POST请求 (1). 数据不会出现在地址栏中 (2). 数据的大小没有上限 (3). 有请求体 (4). 请求体中如果存在中文,会使用URL编码! !!!requests.post()用法与requests.get()完全一致,特殊的是requests.post()有一个data参数,用来存放请求体数据! 2、POST请求自动登录github 对于登录来说,应该在登录输入框内输错用户名或密码然后抓包分析通信流程,假如输对了浏览器就直接跳转了,还分析什么鬼?就算累死你也找不到数据包 1 ''' 2 3 POST请求自动登录github。 4 github反爬: 5 1.session登录请求需要携带login页面返回的cookies 6 2.email页面需要携带session页面后的cookies 7 ''' 8 9 import requests 10 import re 11 # 一 访问login获取authenticity_token 12 login_url = 'https://github.com/login' 13 headers = { 14 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36', 15 'Referer': 'https://github.com/' 16 } 17 login_res = requests.get(login_url, headers=headers) 18 # print(login_res.text) 19 authenticity_token = re.findall('name="authenticity_token" value="(.*?)"', login_res.text, re.S)[0] 20 # print(authenticity_token) 21 login_cookies = login_res.cookies.get_dict() 22 23 24 # 二 携带token在请求体内往session发送POST请求 25 session_url = 'https://github.com/session' 26 27 session_headers = { 28 'Referer': 'https://github.com/login', 29 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36', 30 } 31 32 form_data = { 33 "commit": "Sign in", 34 "utf8": "✓", 35 "authenticity_token": authenticity_token, 36 "login": "username", 37 "password": "githubpassword", 38 'webauthn-support': "supported" 39 } 40 41 # 三 开始测试是否登录 42 session_res = requests.post( 43 session_url, 44 data=form_data, 45 cookies=login_cookies, 46 headers=session_headers, 47 # allow_redirects=False 48 ) 49 50 session_cookies = session_res.cookies.get_dict() 51 52 url3 = 'https://github.com/settings/emails' 53 email_res = requests.get(url3, cookies=session_cookies) 54 55 print('账号' in email_res.text) 56 57 自动登录github(手动处理cookies信息) 六、response响应 1、response属性![]() ![]() ![]()
|
今日新闻 |
点击排行 |
|
推荐新闻 |
图片新闻 |
|
专题文章 |
CopyRight 2018-2019 实验室设备网 版权所有 win10的实时保护怎么永久关闭 |